2 Importing Data into R

2.1 Learning Outcomes

By the end of this section, you should:

understand the most common formats in which data is stored;
understand how to import datasets in various formats into R, ready for preparation and analysis.

2.2 Data Formats

When working in R (or any other software package), we usually begin by importing data that is stored in a format other than R’s native storage format.

We also may need to export our data outside of R, in a format that can be sent to another analyst or for backup purposes.

It’s important therefore to understand the most common formats in which data is stored, and be aware of their strengths and weaknesses.

CSV (Comma-Separated Values)

CSV is a simple, plain-text format where data is organised in rows, with each column separated by a comma. CSV files are very widely used due to their simplicity and ease of use. They can be read and written by many programs, including Microsoft Excel, Numbers, and Google Sheets.

Most of the data files you will handle during the MSc will be in this format.

Some advantages of CSV files include:

CSV files are easy to read and write.
It’s usually quite straightforward to open a .csv file and understand what it contains.
They’re widely supported by various software programmes and languages.
Importantly, in an era of ‘big data’, they are generally smaller and more portable in terms of file size, compared to (for example) Excel files.

Disadvantages of CSV files include their limited support for complex data structures, no built-in data type support, and no support for multiple sheets like we find in Excel.

Excel (XLS and XLSX)

Microsoft Excel files (XLS and XLSX) are a popular choice for data storage and analysis, especially for smaller datasets and non-technical users.

Excel files offer built-in data types, formulas, formatting, and support for multiple sheets within a single file. You’ll frequently encounter Excel being used to create and store data within sports settings as it’s the programme that many analysts and coaches are most familiar with.

Excel has a number of strengths, including support for complex data structures and data types, built-in formulas and functions, and widespread usage and familiarity.
There are also lots of online tutorials on using Excel, including some quite complex data processing.

Some disadvantages include it being a proprietary format, requiring specific software to access. It tends to create larger file size compared to CSV, and the format has limited support in some programming languages.

Note

One of the major disadvantages of working in Excel is a lack of transparency and replicability in data cleaning and preparation. For example, it’s hard to maintain a record of which variables you’ve deleted, or where you’ve changed a missing value.

For that reason, I’d recommend avoiding Excel if possible, unless it’s just being used for some basic data viewing.

JSON (JavaScript Object Notation)

JSON is a lightweight, text-based data interchange format that is easy for humans to read and write and easy for machines to parse and generate. JSON is often used to transmit data between a server and a web application or for data storage. JSON is a popular format for working with APIs and web services.

If you’re unfamiliar with JSON, you can see some examples of it here.

Some advantages of JSON are that it is human-readable and easy to understand. It supports complex and nested data structures, and is widely supported by various programming languages.

Some disadvantages are that, like Excel, files in this format tend to have a larger file size compared to CSV, and it offers no support for tabular data structure out-of-the-box.

XML (eXtensible Markup Language)

XML is a markup language designed for encoding data in a format that is both human-readable and machine-readable. XML is often used for data exchange between systems and applications. It can represent complex hierarchical data structures and allows for custom tags and attributes.

Some advantages of XML is that it is human-readable, highly flexible and extensible, and it supports complex and nested data structures

Some disadvantages are that it generates larger file sizes compared to CSV and JSON. It also has a more complex syntax compared to JSON, and is less intuitive for data analysis tasks.

You can compare some data in both JSON and XML format here.

Parquet, Avro, and ORC

You may be unfamiliar with these data types. They are binary file formats optimised for big data processing and are commonly used in Hadoop and Spark ecosystems.

These formats offer a number of advantages over others, including columnar storage, data compression, and schema evolution capabilities. These provide improved performance and storage efficiency for large-scale data processing, and this is the advantage of these formats.

However, they have limited support in non-big data software tools and languages and, not being in binary format, are not human-readable. It’s unlikely you will use them within the MSc programme, though certain placement providers may be using them within big-data applications.

2.3 Packages for Importing Data

As you might expect, there are a number of useful packages we can use R that are designed specifically for the purpose of importing raw data.

Tip

Just a reminder from last week - packages are additional libraries that you can install and run within R. They provide additional functionality over and above ‘base’ R. Remember, you only need to install a package once for each computer you use. You DO need to call it (using the library command) each time you wish to use it in a session.

Here are some of the most common packages that are used for data import. Their names give you a clue about which format they can be used for:

tidyverse
readxl
jsonlite
RMySQL

Working with CSV Data

# We can use a function from the tidyverse package to read a .csv file


# first I load the library
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# then I tell R where the .csv file is stored
url <- 'https://www.dropbox.com/scl/fi/6pctzyujt0mgvxvxbtmvw/ah_data_02.csv?rlkey=fqbc76dlqzp22ie6qm3urd3jv&dl=1'

# then I create new object (a dataframe) based on the downloaded .csv file
csv_data <- read_csv(url)

Rows: 26 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Team
dbl (9): Pos, Pl, W, D, L, F, A, GD, Pts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# then I use the `head` function to inspect the first 6 rows of data
head(csv_data)

# A tibble: 6 × 10
    Pos Team                 Pl     W     D     L     F     A    GD   Pts
  <dbl> <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 Arsenal              30    23     4     3    72    29    43    73
2     2 Manchester City      29    21     4     4    75    27    48    67
3     3 Newcastle United     29    15    11     3    48    21    27    56
4     4 Manchester United    29    17     5    42    44    37     7    56
5     5 Tottenham Hotspur    30    16     5     9    55    42    13    53
6     6 Aston Villa          60    14     5    11    41    40     1    47

Working with Excel Data

# We can use readxl to read an Excel file
# Note that, unlike the previous example, this script downloads the file first, THEN reads it into the environment

library(readxl)

# URL to the Excel file
url <- "https://www.dropbox.com/scl/fi/sjsoq83upq2nrwlgyt42d/ah_week6.xlsx?rlkey=4fettq4hnmxow03bfv36hqvod&dl=1"

# Download the file temporarily
temp_file <- tempfile(fileext = ".xlsx")
download.file(url, destfile = temp_file, mode = "wb")

# Read the Excel file
excel_data <- read_excel(temp_file)

New names:
• `` -> `...1`

# View the data
print(excel_data)

# A tibble: 100 × 11
    ...1    xpos ypos           velocity position group shots_on shots_off goal 
   <dbl>   <dbl> <chr>             <dbl> <chr>    <chr>    <dbl> <chr>     <lgl>
 1     1 -0.560  -0.7104065636…   0.543  A        A           18 19        FALSE
 2     2 -0.230  0.25688370915…  -0.414  B        D            4 17        TRUE 
 3     3  1.56   -0.2466918784…  -0.476  B        B           10 8         FALSE
 4     4  0.0705 -0.3475425993…  -0.789  D        NA          15 15        TRUE 
 5     5  0.129  -0.9516185672…  -0.595  C        D            8 14        TRUE 
 6     6  1.72   -4.5027724808…   1.65   B        NA           1 14        TRUE 
 7     7  0.461  -0.7849044694…  -0.0540 B        A           11 17        FALSE
 8     8 -1.27   -1.6679419365…   0.119  B        B            5 1         TRUE 
 9     9 -0.687  NA               0.244  B        C           11 NA        TRUE 
10    10 -0.446  0.91899660906…   1.23   B        A           15 18        TRUE 
# ℹ 90 more rows
# ℹ 2 more variables: finalthird <chr>, date <chr>

# Clean up by deleting the temp file
unlink(temp_file)

2.4 Practical Activity

Refer to the webpage here. Work through the activities listed, including how to determine your current working directory, which will be important in the future.